Wine Quality Analysis Based on its chemical content

In this report we will do detailed analysis on different chemical composition of white wine and its effect on quality.

Summary information on different variables

## [1] 4898   13
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000
## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

There are total of 4898 wine samples.

Univariant Data Analysis

Histogram Analysis

## 
##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5

Most of the wine in this data falls in the quality score of 5, 6, and 7. There is no wine in the data set with quality less than score of 3 or score of 10.

The fixed acidity has normal distribution.

The Volatile acidity has normal distribution.

The citric acid has normal distribution.

The chlorides has normal distribution.

The pH has normal distribution.

The sulphates has normal distribution.

The density has normal distribution.

The residual sugar has been log transformed, to identify differnt model that is present in it. The residual sugar has bimodal distribution, most wine fall in two sugar values.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   23.00   34.00   35.31   46.00  289.00

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   108.0   134.0   138.4   167.0   440.0

Feature, free sulfur dioxide and total sulfur dioxide has outliers. For free sulfur dioxide the value of median 34, 3rd quadrant is 46 but max value is 289. Similarly for total sulfur dioxide median is 134, 3rd quadrant is 167 but max value is 440.

The alcohol seems to have uniform distribution.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1    1225    2450    2450    3674    4898

It is difficult to visualize any information on X without doing any transformation. After doing log tranformation on variable X, I found x has right skewed distribution.

What is structure of Data Set ?

The data set has 4898 wines with 13 features (X, fixed acidity, volatile acidity, Citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, Density, ph, sulphates, alcohol, quality). Following are observations made about the data * Most of the wine in this data falls in the quality score of 5, 6 and 7. * There is no wine in the data set with quality less than score of 3 or score of 10. * Feature alcohol is uniformly distributed. * Feature X is negatively skewed. * Other features (fixed acidity, volatile acidity, Citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, ph, sulphates, alcohol, quality) are normally distributed.

What are the main features of interest?

The quality is the main feature of interest, I will find out how other features will influence quality. I strongly suspect residual sugar has some relationship with the quality of the wine. Logically it makes sense for alcohol content to have some relationship with wine quality. From this univariant analysis it very difficult to establish any relationship between quality and other features.

What are the unusual patterns did u saw in your univariant data analysis? Did you make any adjustment to the data tidy or transform the shape of the data? If so, why did you do this?

The feature X has unusual shape with histogram with default values, so I applied log transform to the data to obtain right skewed distribution. With most values falling around the value 2500. Residual sugar value is transformed from left skewed to bimodal distribution most of the wine falling around 3 or 9. The bin values is adjusted in all the histograms.

Bivariant Data Analysis

Scatterplot Analysis

## 
## Two-Step Estimates
## 
## Correlations/Type of Correlation:
##                             X fixed.acidity volatile.acidity citric.acid
## X                           1       Pearson          Pearson     Pearson
## fixed.acidity         -0.2558             1          Pearson     Pearson
## volatile.acidity     0.002858       -0.0227                1     Pearson
## citric.acid           -0.1499        0.2892          -0.1495           1
## residual.sugar       0.006624       0.08902          0.06429     0.09421
## chlorides            -0.04565       0.02309          0.07051      0.1144
## free.sulfur.dioxide  -0.01193       -0.0494         -0.09701     0.09408
## total.sulfur.dioxide   -0.162       0.09107          0.08926      0.1211
## density                -0.186        0.2653          0.02711      0.1495
## pH                    -0.1158       -0.4259         -0.03192     -0.1637
## sulphates            0.009808      -0.01714         -0.03573     0.06233
## alcohol                0.2137       -0.1209          0.06772    -0.07573
## quality               0.03576       -0.1137          -0.1947   -0.009209
##                      residual.sugar chlorides free.sulfur.dioxide
## X                           Pearson   Pearson             Pearson
## fixed.acidity               Pearson   Pearson             Pearson
## volatile.acidity            Pearson   Pearson             Pearson
## citric.acid                 Pearson   Pearson             Pearson
## residual.sugar                    1   Pearson             Pearson
## chlorides                   0.08868         1             Pearson
## free.sulfur.dioxide          0.2991    0.1014                   1
## total.sulfur.dioxide         0.4014    0.1989              0.6155
## density                       0.839    0.2572              0.2942
## pH                          -0.1941  -0.09044          -0.0006178
## sulphates                  -0.02666   0.01676             0.05922
## alcohol                     -0.4506   -0.3602             -0.2501
## quality                    -0.09758   -0.2099            0.008158
##                      total.sulfur.dioxide  density      pH sulphates
## X                                 Pearson  Pearson Pearson   Pearson
## fixed.acidity                     Pearson  Pearson Pearson   Pearson
## volatile.acidity                  Pearson  Pearson Pearson   Pearson
## citric.acid                       Pearson  Pearson Pearson   Pearson
## residual.sugar                    Pearson  Pearson Pearson   Pearson
## chlorides                         Pearson  Pearson Pearson   Pearson
## free.sulfur.dioxide               Pearson  Pearson Pearson   Pearson
## total.sulfur.dioxide                    1  Pearson Pearson   Pearson
## density                            0.5299        1 Pearson   Pearson
## pH                               0.002321 -0.09359       1   Pearson
## sulphates                          0.1346  0.07449   0.156         1
## alcohol                           -0.4489  -0.7801  0.1214  -0.01743
## quality                           -0.1747  -0.3071 0.09943   0.05368
##                      alcohol quality
## X                    Pearson Pearson
## fixed.acidity        Pearson Pearson
## volatile.acidity     Pearson Pearson
## citric.acid          Pearson Pearson
## residual.sugar       Pearson Pearson
## chlorides            Pearson Pearson
## free.sulfur.dioxide  Pearson Pearson
## total.sulfur.dioxide Pearson Pearson
## density              Pearson Pearson
## pH                   Pearson Pearson
## sulphates            Pearson Pearson
## alcohol                    1 Pearson
## quality               0.4356       1
## 
## Standard Errors:
##                            X fixed.acidity volatile.acidity citric.acid
## X                                                                      
## fixed.acidity        0.01336                                           
## volatile.acidity     0.01429       0.01428                             
## citric.acid          0.01397        0.0131          0.01397            
## residual.sugar       0.01429       0.01418          0.01423     0.01416
## chlorides            0.01426       0.01428          0.01422      0.0141
## free.sulfur.dioxide  0.01429       0.01426          0.01416     0.01416
## total.sulfur.dioxide 0.01392       0.01417          0.01418     0.01408
## density               0.0138       0.01328          0.01428     0.01397
## pH                    0.0141        0.0117          0.01428     0.01391
## sulphates            0.01429       0.01429          0.01427     0.01423
## alcohol              0.01364       0.01408          0.01422     0.01421
## quality              0.01427       0.01411          0.01375     0.01429
##                      residual.sugar chlorides free.sulfur.dioxide
## X                                                                
## fixed.acidity                                                    
## volatile.acidity                                                 
## citric.acid                                                      
## residual.sugar                                                   
## chlorides                   0.01418                              
## free.sulfur.dioxide         0.01301   0.01414                    
## total.sulfur.dioxide        0.01199   0.01372            0.008878
## density                    0.004233   0.01334             0.01305
## pH                          0.01375   0.01417             0.01429
## sulphates                   0.01428   0.01429             0.01424
## alcohol                     0.01139   0.01244              0.0134
## quality                     0.01415   0.01366             0.01429
##                      total.sulfur.dioxide  density      pH sulphates
## X                                                                   
## fixed.acidity                                                       
## volatile.acidity                                                    
## citric.acid                                                         
## residual.sugar                                                      
## chlorides                                                           
## free.sulfur.dioxide                                                 
## total.sulfur.dioxide                                                
## density                           0.01028                           
## pH                                0.01429  0.01416                  
## sulphates                         0.01403  0.01421 0.01394          
## alcohol                           0.01141 0.005594 0.01408   0.01429
## quality                           0.01385  0.01294 0.01415   0.01425
##                      alcohol
## X                           
## fixed.acidity               
## volatile.acidity            
## citric.acid                 
## residual.sugar              
## chlorides                   
## free.sulfur.dioxide         
## total.sulfur.dioxide        
## density                     
## pH                          
## sulphates                   
## alcohol                     
## quality              0.01158
## 
## n = 4898 
## 
## P-values for Tests of Bivariate Normality:
##                               X fixed.acidity volatile.acidity citric.acid
## X                                                                         
## fixed.acidity        1.384e-135                                           
## volatile.acidity       4.43e-79     8.326e-51                             
## citric.acid          8.099e-177    7.094e-126        3.11e-162            
## residual.sugar       1.269e-153    3.961e-142       3.871e-146  6.704e-208
## chlorides                     0             0                0           0
## free.sulfur.dioxide   2.436e-59     9.489e-44        2.307e-50  1.481e-110
## total.sulfur.dioxide  4.165e-65     1.731e-38        3.649e-49  2.145e-108
## density              6.906e-101     2.053e-49        1.458e-45  1.894e-132
## pH                    2.823e-57     5.114e-36        2.379e-36  3.439e-101
## sulphates             1.308e-56     1.076e-33        4.068e-33  4.195e-103
## alcohol              3.053e-105     1.172e-74        1.458e-96  6.265e-186
## quality                       0             0                0           0
##                      residual.sugar chlorides free.sulfur.dioxide
## X                                                                
## fixed.acidity                                                    
## volatile.acidity                                                 
## citric.acid                                                      
## residual.sugar                                                   
## chlorides                         0                              
## free.sulfur.dioxide      2.279e-119         0                    
## total.sulfur.dioxide     9.659e-122         0           2.231e-30
## density                   3.89e-196         0           1.384e-52
## pH                       1.085e-119         0           3.012e-24
## sulphates                2.257e-116         0            1.06e-18
## alcohol                  3.624e-202         0           9.643e-71
## quality                           0         0                   0
##                      total.sulfur.dioxide    density        pH sulphates
## X                                                                       
## fixed.acidity                                                           
## volatile.acidity                                                        
## citric.acid                                                             
## residual.sugar                                                          
## chlorides                                                               
## free.sulfur.dioxide                                                     
## total.sulfur.dioxide                                                    
## density                         1.193e-28                               
## pH                              3.591e-17  1.448e-34                    
## sulphates                       6.053e-32  1.796e-35 1.473e-17          
## alcohol                         2.343e-57 3.223e-108 2.598e-62 3.961e-84
## quality                                 0          0         0         0
##                      alcohol
## X                           
## fixed.acidity               
## volatile.acidity            
## citric.acid                 
## residual.sugar              
## chlorides                   
## free.sulfur.dioxide         
## total.sulfur.dioxide        
## density                     
## pH                          
## sulphates                   
## alcohol                     
## quality                    0

The Pearson R between quality and features are quiet low. The alcohol and quality has highest Pearson R at 0.4356. Other features that might influence quality are fixed acidity, volatile acidity,residual sugar,chlorides, total sulfur dioxide, density and sulphates.

## 
##  Pearson's product-moment correlation
## 
## data:  wine$alcohol and wine$quality
## t = 33.858, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4126015 0.4579941
## sample estimates:
##       cor 
## 0.4355747

From the scatter plot between alcohol and quality we can see alcohol quality at 5,6 and 7 have range of alcohol content from 8 percent to 13 percent. The lower quality wine of 5 and below have alcohol content predominantly in the range of 8 and 11.

## 
##  Pearson's product-moment correlation
## 
## data:  wine$alcohol and wine$density
## t = -87.255, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7908646 -0.7689315
## sample estimates:
##        cor 
## -0.7801376

Density and alcohol seems to have negative correlation with Pearson R value of -0.7801. Wine with higher alcohol content have lower density.

## 
##  Pearson's product-moment correlation
## 
## data:  wine$residual.sugar and wine$density
## t = 107.87, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8304732 0.8470698
## sample estimates:
##       cor 
## 0.8389665

The residual sugar and density has positive correlation with correlation value of 0.839

As expected the density and quality is opposite of alcohol and quality distribution. For instance lower wine quality have predominantly higher quality and lower density.

It is difficult to establish any relationship between fixed acidity and quality. The volatile acidity has lot of outliers, even after removing outliers I cannot establish any relationship between quality and fixed acidity. All the quality values has similar distribution of volatile acidity and fixed acidity.

The residual sugar has lot of outliers, so only top 99 percentile is taken into analysis. From the scatter plot we can infer that wine quality which is 4 or below has predominantly lower residual sugar. Wine quality of 5 and above have similar distribution of residual sugar to one another. This is surprise as I was expecting distribution similar to density and quality, but it was similar to alcohol and quality.

The chlorides has some outliers so only top 99 percentile is considered. The middle wine quality has wide range of chlorides on the other hand lower and higher wine quality has lower chloride values. This may be because middle wine quality has more sample and hence the variance is quiet high.

No relationship could be drawn using quality and total sulfur dioxide.

Different wine quality of wine has similar distribution of sulphates.

The wine quality of 6 and above have higher median alcohol value, median alcohol value have increasing trend from alcohol quality 6 and above. As expected we can see exact opposite trend with density.

I could not find any definite pattern with boxplot for quality with residual sugar, total sulfur dioxide, sulphates.

Lower quality wine has higher median chloride content compared to higher wine quality.

Bivariant Data Analysis

What are the relationship between features that where observed in bivariant analysis?

We can find relationship between quality and alcohol with correlation coefficient of 0.4356. The higher quality of wine has higher the alcohol content. The higher wine quality of 7, 8 and 9 have higher median alcohol content compared to lower wine quality

The next feature that influence the wine quality is density, it has correlation value of -0.3071, the density is physical property which is affected by other chemical feature that is present in the wine, in our case it is affected by alcohol and residual sugar. I strongly suspect density does not affect wine quality in a big way as the density itself affected by presence of other chemicals.

The chloride has negative correlation with wine quality with correlation coefficient of -0.2099. The higher wine quality of 7, 8 and 9 have lower median chloride content compared to lower wine quality.

The other features that seems to have effect on wine quality are fixed acidity, volatile acidity, residual sugar, total sulfur dioxide and sulphates. Further analysis is needed to determine the relationship between these feature and wine quality.

Did you find any relationship between other features other than main feature(s) of interest?

The density strongly correlates with residual sugar. The correlation coefficient between density and residual sugar is 0.839.

There is strong negative correlation between alcohol and density, higher the percentage of alcohol lower is the density.

What are the features that have strong relationship with feature of interest?

I found strong positive correlation between alcohol and quality. The density had strong negative correlation. The chloride is another feature that has negative correlation.

MultiVariant Analysis

## 
##   high    low medium 
##   1060   1640   2198

The wine is divided into three categories of low, medium and high. Quality value less than 6 is categorized as low, wine quality of 6 is categorized as medium and wine quality of greater than 6 is categorized as high.

The low wine quality has most of alcohol value of 11 or lower and volatile acidity in range of 0.2 to 0.6. The medium wine quality has alcohol content are predominantly 11 or lower and volatile acidity in range of 0.1 to 0.5. The high wine quality has alcohol content that are predominantly 11 or higher and volatile acidity in range of 0.1 to 0.5. This behavior is quiet expected as positive correlation between alcohol and wine quality whereas we have negative correlation between volatile acidity and wine quality.

Low and medium wine quality has most of fixed acidity value from 5 to 8.5 and alcohol content less than 11. Whereas high wine quality has most of the values from 5 to 7.5 and alcohol content higher than 11. This is consistent with our correlation analysis.

Low and medium wine quality has most of Residual Sugar value from 2 to 20 and alcohol content less than 11. Whereas high wine quality has most of the values from 2 to 13 and alcohol content higher than 11. This is consistent with our correlation analysis.

Low and medium wine quality has most of Free Sulfur Dioxide value from 10 to 60 and alcohol content less than 11. Whereas high wine quality has most of the values from 25 to 50 and alcohol content higher than 11. This explains the weak correlation between wine quality and free sulfur dioxide.

Low and medium wine quality has most of sulphates value from .3 to .6 and alcohol content less than 11. Whereas high wine quality has most of the values from .25 to .7 and alcohol content higher than 11. This is consistent with our correlation analysis.

Chlorides and free sulfur dioxide for different alcohol content does not seems to have any effect on quality.

##             alcohol    volatile.acidity             density 
##            5.142229            1.041865           16.008081 
##       fixed.acidity      residual.sugar free.sulfur.dioxide 
##            1.406153            7.233439            1.147541 
##           sulphates 
##            1.125417
##             alcohol    volatile.acidity       fixed.acidity 
##            1.303170            1.029825            1.026391 
##      residual.sugar free.sulfur.dioxide           sulphates 
##            1.346054            1.147219            1.006988

Since density is affected by alcohol and residual sugar, before the linear regression analysis variation inflation factor (vif) of all our variables has to be checked. The density has high vif and after removing density, vif for other variables are acceptable.

Thus we will use alcohol, volatile acidity, fixed acidity, residual sugar, free sulfur dioxide and sulphates to create linear model with quality.

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"              "qualityLabel"
## 
## Calls:
## m1: lm(formula = quality ~ alcohol, data = wine)
## m2: lm(formula = quality ~ alcohol + volatile.acidity, data = wine)
## m3: lm(formula = quality ~ alcohol + volatile.acidity + fixed.acidity, 
##     data = wine)
## m4: lm(formula = quality ~ alcohol + volatile.acidity + fixed.acidity + 
##     residual.sugar, data = wine)
## m5: lm(formula = quality ~ alcohol + volatile.acidity + fixed.acidity + 
##     residual.sugar + free.sulfur.dioxide, data = wine)
## m6: lm(formula = quality ~ alcohol + volatile.acidity + fixed.acidity + 
##     residual.sugar + free.sulfur.dioxide + pH, data = wine)
## m7: lm(formula = quality ~ alcohol + volatile.acidity + fixed.acidity + 
##     residual.sugar + free.sulfur.dioxide + sulphates, data = wine)
## 
## ====================================================================================================
##                           m1         m2         m3         m4         m5         m6         m7      
## ----------------------------------------------------------------------------------------------------
##   (Intercept)           2.582***   3.017***   3.548***   2.919***   2.663***   1.901***   2.446***  
##                        (0.098)    (0.098)    (0.141)    (0.150)    (0.157)    (0.338)    (0.164)    
##   alcohol               0.313***   0.324***   0.319***   0.370***   0.377***   0.377***   0.378***  
##                        (0.009)    (0.009)    (0.009)    (0.010)    (0.010)    (0.010)    (0.010)    
##   volatile.acidity                -1.979***  -1.988***  -2.119***  -2.052***  -2.043***  -2.040***  
##                                   (0.110)    (0.109)    (0.109)    (0.109)    (0.109)    (0.109)    
##   fixed.acidity                              -0.068***  -0.074***  -0.068***  -0.052***  -0.067***  
##                                              (0.013)    (0.013)    (0.013)    (0.014)    (0.013)    
##   residual.sugar                                         0.027***   0.024***   0.025***   0.025***  
##                                                         (0.002)    (0.002)    (0.003)    (0.002)    
##   free.sulfur.dioxide                                               0.004***   0.004***   0.004***  
##                                                                    (0.001)    (0.001)    (0.001)    
##   pH                                                                           0.205*               
##                                                                               (0.081)               
##   sulphates                                                                               0.412***  
##                                                                                          (0.095)    
## ----------------------------------------------------------------------------------------------------
##   R-squared                 0.2        0.2        0.2        0.3        0.3        0.3        0.3   
##   adj. R-squared            0.2        0.2        0.2        0.3        0.3        0.3        0.3   
##   sigma                     0.8        0.8        0.8        0.8        0.8        0.8        0.8   
##   F                      1146.4      773.9      527.7      437.6      358.3      300.0      302.8   
##   p                         0.0        0.0        0.0        0.0        0.0        0.0        0.0   
##   Log-likelihood        -5839.4    -5681.8    -5668.2    -5605.7    -5590.5    -5587.2    -5581.1   
##   Deviance               3112.3     2918.3     2902.2     2829.0     2811.5     2807.8     2800.7   
##   AIC                   11684.8    11371.6    11346.4    11223.4    11195.0    11190.5    11178.2   
##   BIC                   11704.3    11397.5    11378.9    11262.4    11240.5    11242.5    11230.2   
##   N                      4898       4898       4898       4898       4898       4898       4898     
## ====================================================================================================

From the table above we can see all the feature added to the linear model had statistical significance with linear model. Number of Stars present in the intercept of the table represents statistical significance. Zero star represent no significance, three star represent highest significance.

What are the new feature introduced for multi variant analysis?

The wine quality is categorized into three categories low, medium and good. Quality value less than 6 is categorized as low, wine quality of 6 is categorized as medium and wine quality of greater than 6 is categorized as high.

What are the observation made in your analysis? Were the features strengthened each other in terms of looking at feature of interest?

Alcohol content together with volatile acidity, fixed acidity, residual sugar, free sulfur dioxide, pH and sulphates seems to have effect on quality of wine.

High alcohol with lower fixed acidity, volatile acidity and residual sugar produce high wine quality. On the other hand high alcohol content with high free sulfur dioxide, pH and sulphates produce high wine quality.

Is there any surprising finding in your analysis?

I was expecting some relationship between alcohol and chloride, alcohol and free sulfur dioxide on wine quality. But I was surprised to find no relationship between these features on wine quality.

Did you create any models with your dataset? Discuss details of the model?

Yes I created linear model using alcohol, volatile acidity, fixed acidity, residual sugar, free sulfur dioxide, pH and sulphates.

The model accounts for 30% variance in quality of the wine. The density is not added to this model because of high variance inflation factor .The variance observed is not very high and thus it may not be reliable to predict the wine quality based on this model.

Final Plots and Summary

Plot One

Description One

Most of the wine for this dataset in available for value of 5,6 and 7. And there is no wine with quality less than zero and wine quality at 10.

Plot Two

Description Two

Alcohol has maximum (positive) correlation with quality at 46%. So alcohol will provide more information about wine quality compared to other chemical features. The regression line also confirms the positive correlation of alcohol with quality.

From the scatter plot between alcohol and quality we can see alcohol quality at 5, 6 and 7 have range of alcohol content from 8 percent to 13 percent. This is because wine quality with these values have larger wine samples compared to other wine quality.

The lower quality wine of 5 and below have alcohol content predominantly in the range of 8 and 11. The wine quality of 6 and above have higher median alcohol value, median alcohol value have increasing trend from wine quality of 6 and above.

Plot Three

Description Three

The residual sugar has maximum (negative) correlation with alcohol (among chemicals that affects quality) at 43.5% , so we can compare how residual sugar affects alcohol, as well as how both of them in turn affect the wine quality. The regression lines also confirms the negative correlation between alcohol and residual sugar.

For the above plot we only consider alcohol value from 8 to 14 and residual sugar from 0 to 20 as most of the wine sample fall in this value range. We can see wine quality in range of 5,6 and 7 dominates the plot.

Low and medium wine quality has most of Residual Sugar value from 2 to 20 and alcohol content less than 11. Whereas high wine quality has most of the values from 2 to 13 and alcohol content higher than 11. This is consistent with our correlation analysis of quality with alcohol and residual sugar.

Reflection

There are 4898 wine samples in the dataset. I started exploring the data dataset using single variables. Later I formulated some questions and explored some interesting features in the dataset. Finally I explored the relationship between wine quality and other chemical features in the dataset.

Wine quality has positive correlation with alcohol, free sulfur dioxide, pH and sulphates. On the other hand wine quality has negative correlation with density, chlorides, lower fixed acidity, volatile acidity and residual sugar. With further analysis free sulfur dioxide, chlorides has relatively low influence on wine quality. So I created a linear model using alcohol, volatile acidity, fixed acidity, residual sugar, free sulfur dioxide, pH and sulphates. The model accounts for 30% variance in quality of the wine. The density is not added to this model because of high variance inflation factor .The variance observed is not very high and thus it may not be reliable to predict the wine quality based on these features.

Main problem I faced with this analysis is that none of the chemicals had strong correlation with wine quality. Alcohol is the feature which had highest correlation at 46%. In univariate and bivariate analysis I found difficult to establish relationship between many features with wine quality. Then I have to use different multivariate plots, categorize quality into three types to start seeing some relationship between features and quality. I also need to perform variance inflation factor to filter out important features.

The main drawback in the dataset is that wine count for some quality wines is quiet low. There is no wine at wine quality less than 3 and wine quality of 10. Furthermore for wine quality of 3,4,8 and 9 has only 20,163, 175 and 9 wine samples respectively. On the other hand wine quality of 5, 6 and 7 accounts for wine count of 1457, 2198 and 880 respectively. A dataset with evenly distributed wine count for different wine quality would make analysis on wine quality much more reliable and predictive model will be much more accurate.

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot. Similarly message = FALSE parameter was added to